Conversation
- Add --log-mode verbose and --log-filter * to record-trace invocation - Set DOTNET_DiagnosticServerDiag=1 on tracee process for runtime logging - Add fprintf diagnostic logging to ds-server.c (IPC command handling) - Add fprintf diagnostic logging to ep-session-provider.c (tracepoint registration) - Add fprintf diagnostic logging to ep-session.c (tracepoint write) Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
…/arm64 Disable all CI pipeline jobs in runtime.yml except those needed to run CoreCLR runtime tests on linux-x64 and linux-arm64 for investigating flaky userevents tests (issue #123442). Kept jobs: - EvaluatePaths stage - Libraries_CheckedCoreCLR build for linux_arm64 - Libraries_CheckedCoreCLR build for linux_x64 - CoreCLR test build (build-test-job.yml with CoreClrTestBuildHost) - CoreCLR test execution (run-test-job.yml for linux_x64 and linux_arm64) All other ~63 job blocks are disabled with condition: false. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per the one-collect commandline.rs API, --log-mode accepts 'disabled', 'console', or 'file', and --log-filter accepts Rust tracing EnvFilter strings like 'trace', 'debug', 'info'. Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
Change traceValidator signature to Func<int, EventPipeEventSource, bool> to pass the tracee process ID. Each scenario validator now filters events by ProcessID and logs a count of events from other processes. Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR adds additional diagnostic logging to help investigate flakiness in the src/tests/tracing/userevents functional tests (issue #123442), including filtering/annotating expected events by the tracee PID and enabling extra runtime diagnostics during test runs.
Changes:
- Pass the tracee PID into per-scenario trace validators and ignore events from other processes in the trace validation.
- Enable extra diagnostic output in the tracee runtime and in
record-traceinvocation to aid debugging. - Add temporary diagnostic logging in native EventPipe/DiagnosticServer code paths, and temporarily alter/disable large parts of the Azure Pipelines runtime matrix for investigation.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/tests/tracing/userevents/multithread/multithread.cs | Filters validation to tracee PID and improves failure diagnostics. |
| src/tests/tracing/userevents/managedevent/managedevent.cs | Filters validation to tracee PID and improves failure diagnostics. |
| src/tests/tracing/userevents/custommetadata/custommetadata.cs | Filters validation to tracee PID and improves failure diagnostics. |
| src/tests/tracing/userevents/basic/basic.cs | Filters validation to tracee PID and improves failure diagnostics. |
| src/tests/tracing/userevents/activity/activity.cs | Filters validation to tracee PID and improves failure diagnostics. |
| src/tests/tracing/userevents/common/UserEventsTestRunner.cs | Updates validator signature, adds record-trace logging args, and enables runtime diag env var for tracee. |
| src/native/eventpipe/ep-session.c | Adds diagnostic logging around user-events tracepoint write path. |
| src/native/eventpipe/ep-session-provider.c | Adds diagnostic logging around tracepoint registration. |
| src/native/eventpipe/ds-server.c | Adds diagnostic logging around diagnostic server IPC handling. |
| eng/pipelines/runtime.yml | Temporarily disables/narrows many CI legs for investigation. |
| # TEMPORARY: disabled for userevents diagnostic investigation (https://github.com/dotnet/runtime/issues/123442) | ||
| condition: false |
There was a problem hiding this comment.
This change sets the job condition to false, effectively disabling this pipeline leg for all PRs/rolling builds. Even if intended for short-term investigation, it will prevent CI from validating unrelated changes in this pipeline and is very risky to merge. Please gate this behind a dedicated variable/parameter (e.g., a temporary feature flag) or revert before merging.
| # TEMPORARY: narrowed to linux_x64 only for userevents diagnostic investigation (https://github.com/dotnet/runtime/issues/123442), was also windows_x64 | ||
| - linux_x64 |
There was a problem hiding this comment.
These platform matrix changes remove previously covered OSes (e.g., Windows/macOS) for this leg. That materially reduces test coverage and can hide regressions unrelated to the userevents investigation. Please avoid narrowing platforms in the main pipeline, or gate it behind a temporary variable so normal PR validation coverage is preserved.
| traceeStartInfo.Environment["DOTNET_DiagnosticServerDiag"] = "1"; | ||
|
|
There was a problem hiding this comment.
DOTNET_DiagnosticServerDiag is being unconditionally enabled for every tracee run. This can significantly increase stderr output and potentially affect timing/behavior of the scenario under test, making results harder to interpret once the investigation is done. Consider enabling it only when a separate opt-in env var/command-line switch is set (or only under Helix/CI) so local/normal runs stay quiet by default.
| traceeStartInfo.Environment["DOTNET_DiagnosticServerDiag"] = "1"; | |
| // This is opt-in locally via DOTNET_UserEvents_EnableDiagnosticServerDiag=1, and defaults to on under Helix. | |
| bool enableDiagnosticServerDiag = | |
| string.Equals(Environment.GetEnvironmentVariable("DOTNET_UserEvents_EnableDiagnosticServerDiag"), "1", StringComparison.Ordinal) || | |
| !string.IsNullOrEmpty(Environment.GetEnvironmentVariable("HELIX_WORKITEM_ROOT")); | |
| if (enableDiagnosticServerDiag) | |
| { | |
| traceeStartInfo.Environment["DOTNET_DiagnosticServerDiag"] = "1"; | |
| } |
| static int ep_sess_diag_enabled = -1; // -1 = uninitialized | ||
| static inline int ep_sess_diag_is_enabled (void) { | ||
| if (ep_sess_diag_enabled == -1) { | ||
| const char *val = getenv ("DOTNET_DiagnosticServerDiag"); | ||
| ep_sess_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0; | ||
| } | ||
| return ep_sess_diag_enabled; |
There was a problem hiding this comment.
ep_sess_diag_enabled is lazily initialized without any synchronization. session_tracepoint_write_event can be called from multiple threads, so concurrent first-use can cause a data race/UB in C. Consider using an atomic/once-init pattern (e.g., store the flag in a volatile int64_t and initialize via ep_rt_atomic_compare_exchange_int64_t, or always compute from getenv without caching) to make this thread-safe.
| static int ep_sess_diag_enabled = -1; // -1 = uninitialized | |
| static inline int ep_sess_diag_is_enabled (void) { | |
| if (ep_sess_diag_enabled == -1) { | |
| const char *val = getenv ("DOTNET_DiagnosticServerDiag"); | |
| ep_sess_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0; | |
| } | |
| return ep_sess_diag_enabled; | |
| static inline int ep_sess_diag_is_enabled (void) { | |
| const char *val = getenv ("DOTNET_DiagnosticServerDiag"); | |
| return (val != NULL && val[0] == '1') ? 1 : 0; |
| if (bytes_written == -1) { | ||
| EP_SESS_DIAG_LOG ("tracepoint_write_event: writev failed for provider='%s', event_id=%u, fd=%d, errno=%d (%s).\n", ep_provider_get_provider_name (provider), (unsigned)ep_event_get_event_id (ep_event), session->user_events_data_fd, errno, strerror (errno)); | ||
| } else { | ||
| EP_SESS_DIAG_LOG ("tracepoint_write_event: writev success for provider='%s', event_id=%u, bytes_written=%zd.\n", ep_provider_get_provider_name (provider), (unsigned)ep_event_get_event_id (ep_event), bytes_written); | ||
| } |
There was a problem hiding this comment.
This code calls strerror(errno) but ep-session.c doesn't include the header that declares strerror (typically <string.h>). With warnings-as-errors this can fail the build due to an implicit declaration. Please include the appropriate header (and/or follow the repo’s existing include pattern for strerror).
| static int ep_sp_diag_enabled = -1; // -1 = uninitialized | ||
| static inline int ep_sp_diag_is_enabled (void) { | ||
| if (ep_sp_diag_enabled == -1) { | ||
| const char *val = getenv ("DOTNET_DiagnosticServerDiag"); | ||
| ep_sp_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0; | ||
| } | ||
| return ep_sp_diag_enabled; |
There was a problem hiding this comment.
ep_sp_diag_enabled is lazily initialized without synchronization. If this file’s logging is reachable from multiple threads during session setup/teardown, the first-use initialization can race and is undefined behavior in C. Please use an atomic/once-init pattern (or avoid caching) so the env-var check is thread-safe.
| static int ep_sp_diag_enabled = -1; // -1 = uninitialized | |
| static inline int ep_sp_diag_is_enabled (void) { | |
| if (ep_sp_diag_enabled == -1) { | |
| const char *val = getenv ("DOTNET_DiagnosticServerDiag"); | |
| ep_sp_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0; | |
| } | |
| return ep_sp_diag_enabled; | |
| static inline int ep_sp_diag_is_enabled (void) { | |
| const char *val = getenv ("DOTNET_DiagnosticServerDiag"); | |
| return (val != NULL && val[0] == '1') ? 1 : 0; |
| static int ds_diag_enabled = -1; // -1 = uninitialized | ||
| static inline int ds_diag_is_enabled (void) { | ||
| if (ds_diag_enabled == -1) { | ||
| const char *val = getenv ("DOTNET_DiagnosticServerDiag"); | ||
| ds_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0; | ||
| } | ||
| return ds_diag_enabled; |
There was a problem hiding this comment.
ds_diag_enabled is lazily initialized without synchronization. If multiple threads can reach this logging (e.g., server loop interactions), the first-use initialization can race and is undefined behavior in C. Please switch to an atomic/once-init pattern (or avoid caching) to make the env-var check thread-safe.
| static int ds_diag_enabled = -1; // -1 = uninitialized | |
| static inline int ds_diag_is_enabled (void) { | |
| if (ds_diag_enabled == -1) { | |
| const char *val = getenv ("DOTNET_DiagnosticServerDiag"); | |
| ds_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0; | |
| } | |
| return ds_diag_enabled; | |
| static inline int | |
| ds_diag_is_enabled (void) | |
| { | |
| const char *val = getenv ("DOTNET_DiagnosticServerDiag"); | |
| return (val != NULL && val[0] == '1') ? 1 : 0; |
The disabled osx_arm64 test run job depends on a Libraries_CheckedCoreCLR build for osx_arm64 existing. When osx_arm64 was removed from the build platforms list, Azure Pipelines YAML validation fails because the dependency cannot be resolved (even though the test job has condition: false). Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
Remove osx_arm64 from the Libraries_CheckedCoreCLR build platforms (only linux-x64 and linux-arm64 needed for userevents investigation). Also remove the osx_arm64 test job block entirely since setting condition: false does not skip dependency validation in Azure Pipelines YAML - the build job must still exist for the dependency to resolve. Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
- Remove CLRTestTargetUnsupported from userevents Directory.Build.props (keep Mono exclusion only) - Disable PR triggers for runtime-linker-tests, hardware-intrinsics, global-build (dev-innerloop), and runtime-diagnostics pipelines - All browser-wasm jobs in runtime.yml already had condition: false Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
The diagnostic logging at line 896 calls strerror(errno) but the file did not include <string.h> which declares strerror. This could cause an implicit function declaration warning/error depending on compiler settings. Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
Investigating #123442